library(igraph)
library(chorddiag)
set.seed(1234)
data = read.table("congress.edgelist", sep="", header=F,
col.names=c("V1", "V2", "delete", "weight"))
data = data[,-3]
data$weight = gsub("}", '', data$weight)
data$weight = as.numeric(data$weight)
data.attr = read.csv("gender-party.csv",sep=";", na.strings="")
data.attr$Gender = factor(data.attr$Gender)
data.attr$Party = factor(data.attr$Party)
g = igraph::graph_from_data_frame(data)
# set vertex attributes
g = set_vertex_attr(g, "name", value=data.attr$Name)
g = set_vertex_attr(g, "gender", value=data.attr$Gender)
g = set_vertex_attr(g, "affiliation", value=data.attr$Party)
We can start by plotting the network at hand using some standard graph layouts. As our network is not small, these are not expected to work particularly well. Nonetheless, they might still be able to shed some light or bring some insights on the less-obvious features of the infleucne relations between US Congresspeople.
The circle and star layouts are a good starting point. Instead of using the default star layout, which places the first vertex in the middle, it has been modified to place the vertex of highest degree in the center in an attempt to achieve a clearer visualization.
par(mar=c(0,0,0,0)+.3, mfrow=c(1,2))
plot(g, layout=layout_in_circle,
vertex.label.cex=0.3,
main="Circular layout")
plot(g, layout=layout_as_star(g,center=V(g)[which(degree(g)==max(degree(g)))]),
vertex.label.cex=0.3,
main="Star layout")
As was already warned, these layouts do not work particularly well for this network due to its high order and size. Furthermore, they seem to indicate that this is an extremely dense network, which as we saw in previous week through the study of the adjacency matrix is not quite the case.
Next, the standard tree layout can be considered.
par(mar=c(0,0,0,0)+.3)
plot(g, layout=layout_as_tree,
vertex.label.cex=0.3,
edge.arrow.size=0.1,
main="Tree layout")
Once again, we find a layout that is overwhelmed by the sheer amount of vertices and interactions between them. Unlike in previous layotus however, the tree layout does provide insight into the network’s communities, with perhaps four main communities connected by certain vertices, and even some rather isolated members of US Congress.
Finally, we can give the grid and sphere layouts a try.
par(mar=c(0,0,0,0)+.3, mfrow=c(1,2))
plot(g, layout=layout_on_grid,
vertex.label.cex=0.3,
main="Grid layout")
plot(g, layout=layout_on_sphere,
vertex.label.cex=0.3,
main="Sphere layout")
These layouts do not seem to be good options for the visualization of the network either.
All in all, simple graph layouts seem to lack the sophistication required to represent a network of this size in a clear manner, although the tree layout does allow for the gathering of some possible insights into its structure.
Some more sophisticated layouts are those obtained my maximizing energy functions. The optimization process that leads to these visualization attempts to distribute vertices away from each other, keep edges short and minimize edge crossings, all while trying to keep vertices from coming too close to edges.
Each vertex’s position is calculated and stored with a matrix that
can be fed later to the plot.igraph function. The matrices
for all the different energy functions that will be studied in this
report are calculated below:
dh = layout_with_dh(g)
fr = layout_with_fr(g)
gem = layout_with_gem(g)
graphopt = layout_with_graphopt(g)
kk = layout_with_kk(g)
We start by studying the Davidson-Harel layout, based on a simulated annealing algorithm.
par(mar=c(0,0,0,0)+.3)
plot(g, layout=dh,
vertex.label.cex=0.2, vertex.size=10,
main="Davidson-Harel layout")
The Davidson-Harel layout seems to indicate the existence of two main communities, although not very well separated due to some vertices acting as links between them.
The next energy function studied is that of the Fruchterman-Reingold layout, a force-directed algorithm which uses an analogy of physical springs as edges that attract connected vertices toward each other, while keeping a repulsive force attempting to separate all vertices from one another.
par(mar=c(0,0,0,0)+.3)
plot(g, layout=fr,
vertex.label.cex=0.3, vertex.size=10,
main="Fruchterman-Reingold layout")
This algorithm does not a yield a very clear visualization of the
network.
The next algorithm to be studied is the GEM force-directed layout, which just like before uses an analogy of attractice and repulsive forces between vertices taking into account the edges.
par(mar=c(0,0,0,0)+.3)
plot(g, layout=gem,
vertex.label.cex=0.2, vertex.size=10,
main="GEM force-directed layout")
Although once again not a very clear plot, this layout allows to perhaps identify those vertices that are either mostly influencers or mostly influenced, since they seems to sit in the outer regions of the network. Notice however that said vertices seems to have smaller degrees.
To end, another force directed algorithm is used, the Graphopt layout algorithm, as well as a different approach, the Kamada-Kawai layout algorithm.
par(mar=c(0,0,0,0)+.3, mfrow=c(1,2))
plot(g, layout=graphopt,
vertex.label.cex=0.2, vertex.size=10,
main="Graphopt layout")
plot(g, layout=kk,
vertex.label.cex=0.2, vertex.size=10,
main="Kamada-Kawai layout")
The next and final layout that will be studied is the one based on the statistical technique known as Multidimensional Scaling. The idea behind this is the projection of points from a higher dimensional space onto a plane, while trying to keep the distance between points as faithfully as possible.
mds = layout_with_mds(g)
par(mar=c(0,0,0,0)+.3)
plot(g, layout=mds,
vertex.label.cex=0.2, vertex.size=10,
main="Multidimensional Scaling layout")
In this layout we observe what could seem like two main communities
which are not fully separated, as well as perhaps a third or even forth
one. Furthermore, we also observe some vertices that interact less with
the rest.
Since it has been able to represent together features that were observed separately in other layouts, Multidimensional Scaling (MDS) has been considered the best layout. Nonetheless, the standard MDS layout is not clear enough, as so we will now attempt to obtain a clearer visualization of our network by modifying features of the vertices and edges such as size, width or color, guided by the attributes of the network.
We can start by adding colors to the plot. The color of the vertices will represent the Congressperson’s political affiliation.
plot(g, layout=mds,
vertex.label.cex=0.2, vertex.size=10,
edge.width=0.2,
edge.arrow.size=0.5, edge.arrow.width=0.5)
par(mar=c(0,0,0,0)+.3)
V(g)[affiliation=="Republican"]$color = "red"
V(g)[affiliation=="Democratic"]$color = "blue"
V(g)[affiliation=="Independent"]$color = "green"
plot(g, layout=mds,
vertex.label.cex=0.2, vertex.size=10)
Next, the size of the vertices can be modified according to their
degree, so that more important Congresspeople are represented by bigger
vertices.
par(mar=c(0,0,0,0)+.3)
V(g)$size = 1.5*log(degree(g))+5
plot(g, layout=mds,
vertex.label.cex=0.2)
The edges of the network can also be modified, so that their width and
size in general agrees with their weight attribute. Since the influence
values are generally much smaller than 1 (default value for
edge.size), the edge width in reality has been taking as
their MinMax-scaled values. Unfortunately, this
par(mar=c(0,0,0,0)+.3)
#(g)$width = (E(g)$weight - min(E(g)$weight)) / (max(E(g)$weight) - min(E(g)$weight))
plot(g, layout=mds,
vertex.label.cex=0.2,
edge.arrow.size=0.5, edge.arrow.width=0.5)
Finally, color can also be added to the edges, so that they match the colors of the political affiliation whenever the edge in question joins two members of the same political party, and are grey otherwise.
par(mar=c(0,0,0,0)+.3)
reps = V(g)[affiliation=="Republican"]
dems = V(g)[affiliation=="Democratic"]
inds = V(g)[affiliation=="Independent"]
E(g)$color = "black"
E(g)[reps %--% reps]$color = "darkred"
E(g)[dems %--% dems]$color = "darkblue"
E(g)[inds %--% inds]$color = "darkgreen"
plot(g, layout=mds,
vertex.label.cex=0.2,
edge.arrow.size=0.5, edge.arrow.width=0.5)
However, even after attempting several combinations of colors, this last
step does not seem to lead a clearer visualization.
g = delete_edge_attr(g,"color")
The best visualization of the network considered is the following one:
par(mar=c(0,0,0,0)+.3)
plot(g, layout=mds,
vertex.label.cex=0.2,
edge.arrow.size=0.5, edge.arrow.width=0.5)
We compute the order of the graph and the degree sequence of the network and the average degree of the network.
gorder(g)
## [1] 475
head(degree(g),n=30)
## Tammy Baldwin John Barrasso Michael Bennet
## 46 65 62
## Marsha Blackburn Richard Blumenthal Roy Blunt
## 45 81 61
## Cory Booker John Boozman Mike Braun
## 82 82 79
## Sherrod Brown Maria Cantwell Shelley Moore Capito
## 83 69 50
## Ben Cardin Tom Carper Bob Casey Jr.
## 62 56 44
## Bill Cassidy Chris Coons John Cornyn
## 119 75 19
## Catherine Cortez Masto Tom Cotton Kevin Cramer
## 81 71 115
## Mike Crapo Ted Cruz Steve Daines
## 41 51 57
## Tammy Duckworth Dick Durbin Joni Ernst
## 123 42 42
## Dianne Feinstein Deb Fischer Kirsten Gillibrand
## 32 42 39
mean(degree(g))
## [1] 55.95368
Now, we can have a look at the degree sequence of the network. We also compute the average degree of the network that is approximately 56. Therefore, in average, each congressperson has around 56 connections with other congresspeople. If we have a look at the degree distribution of the network, we conclude that the distribution is positively skewed. Indeed, the average degree is low compared with some of the degrees that can be as high as 123.
table(degree(g))
##
## 2 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## 1 2 1 1 1 2 4 2 4 2 2 3 1 5 2 3 5 4 8 4
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 10 5 4 8 10 5 7 5 14 8 10 9 9 10 11 3 6 10 5 4
## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 5 6 10 7 10 5 8 6 4 6 6 5 4 4 4 7 6 10 3 6
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## 12 4 5 2 2 5 2 3 3 2 5 4 1 1 1 4 2 2 6 3
## 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100 103 106 108 109 110
## 1 1 3 1 1 1 2 2 6 2 5 3 4 2 2 2 2 1 1 1
## 113 115 118 119 120 121 122 123 124 125 126 128 137 138 139 140 142 147 148 152
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1
## 157 163 182 187 190 261 284
## 1 2 1 1 1 1 1
We take a look at the degree distribution and we plot it to better visualize it.
head(degree_distribution(g),n=20)
## [1] 0.000000000 0.000000000 0.002105263 0.000000000 0.000000000 0.004210526
## [7] 0.000000000 0.002105263 0.002105263 0.002105263 0.004210526 0.008421053
## [13] 0.004210526 0.008421053 0.004210526 0.004210526 0.006315789 0.002105263
## [19] 0.010526316 0.004210526
color_1 <- "deepskyblue2"
par(mar=c(5,4,4,2)+0.1)
plot(0:max(degree(g)),degree_distribution(g),col=color_1,
main="Degree distribution",
xlab="Degree",ylab="Frequency",type="h",lwd=1.5)
points(0:max(degree(g)),degree_distribution(g),pch=20)
We can see that it is right-skewed. The average we computed 55.9536842
is very low compared to some values that are as high as 250 and
more.
We check that the network is sparse. For that we obtain the ratio between the size of the network and the maximum number of edges that might exist. Given \(V\), the number of vertices, the maximum number of edges is \[ \frac{V(V-1)}{2}\]
ecount(g)
## [1] 13289
vcount(g)*(vcount(g)-1)/2
## [1] 112575
ecount(g)/(vcount(g)*(vcount(g)-1)/2)
## [1] 0.1180457
The ratio is given by \(0.118\), meaning only that \(11.8%\) of the potential edges already exist.
Consequently, we plot the degree distribution in log scale, i.e., plot log of the degrees vs log of the probabilities (relative frequencies). Ideally, we would have a plot that shows a linear relationship between the log-frequency and the log-degree, so that we can fit a linear regression afterwards.
Note that, for plotting in log scale, it is necessary to plot only those log-degrees and log-probabilities corresponding to degrees and probabilities larger than 0. Therefore the first step is to locate such probabilities. Then, we can make the plot in log-scale.
degree_dis <- degree_distribution(g)
max_degree <- max(degree(g))
degree_dis_positive <- which(degree_dis!=0)
plot(degree_dis_positive-1,degree_dis[degree_dis_positive],log="xy",
col=color_1,main="Log-log degree distribution",
xlab="Log-Degree",ylab="Log-Frequency",pch=19)
As we can observe, we corrected right-skewness. However, this is clearly
non-linear, makes no sense to fit a linear regression. Instead of using
this degree distribution we will try with in-degree and out-degree
distributions and check that a linear regression is more adequate in the
latter distributions.
in_degree_distribution <- degree(g, mode = "in", loops = FALSE)
out_degree_distribution <- degree(g, mode = "out", loops = FALSE)
color_1 <- "deepskyblue2"
par(mar=c(5,4,4,2)+0.1)
plot(0:max(degree(g, mode="in")),degree_distribution(g, mode="in"),col=color_1,
main="In-Degree distribution",
xlab="Degree",ylab="Frequency",type="h",lwd=1.5)
points(0:max(degree(g,mode="in")),degree_distribution(g,mode="in"),pch=20)
color_1 <- "deepskyblue2"
par(mar=c(5,4,4,2)+0.1)
plot(0:max(degree(g, mode="out")),degree_distribution(g, mode="out"),col=color_1,
main="Out-Degree distribution",
xlab="Degree",ylab="Frequency",type="h",lwd=1.5)
points(0:max(degree(g,mode="out")),degree_distribution(g,mode="out"),pch=20)
Again both in-degree and out-degree distributions are right skewed,
showing even more extreme skewness in the out-degree distribution.
degree_dis_in <- degree_distribution(g,mode="in")
max_degree_in <- max(degree(g,mode="in"))
degree_dis_positive_in <- which(degree_dis_in!=0)
degree_dis_out <- degree_distribution(g,mode="out")
max_degree_out <- max(degree(g,mode="out"))
degree_dis_positive_out <- which(degree_dis_out!=0)
plot(degree_dis_positive_in-1,degree_dis_in[degree_dis_positive_in],log="xy",
col=color_1,main="Log-log IN-degree distribution",
xlab="Log-Degree",ylab="Log-Frequency",pch=19)
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0 omitted from
## logarithmic plot
Once that we have corrected skewness, we can fit a power-law to the
log-degree distribution. For that, we estimate a linear regression model
by Weigthed Least Squares (WLS). For that, first, we estimate a linear
regression by Ordinary Least Squares (OLS).
Note that the first two values could be considered as outliers, having a large effect in the fitting, so we eliminate them.
plot(degree_dis_positive_in[-c(1,2,3)] - 1, degree_dis_in[degree_dis_positive_in][-c(1,2,3)], log="xy",
col=color_1, main="Log-log IN-degree distribution",
xlab="Log-Degree", ylab="Log-Frequency", pch=19)
log_deg_dist_in <- log(degree_dis_in[degree_dis_positive_in])[c(-1,-2,-3)]
log_deg_in <- log(degree_dis_positive_in-1)[c(-1,-2,-3)]
first_fit_in <- lm(log_deg_dist_in~log_deg_in)
summary(first_fit_in)
##
## Call:
## lm(formula = log_deg_dist_in ~ log_deg_in)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55115 -0.43149 0.04885 0.46304 1.11159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.50779 0.29212 -5.161 1.68e-06 ***
## log_deg_in -0.93294 0.07876 -11.845 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6018 on 82 degrees of freedom
## Multiple R-squared: 0.6311, Adjusted R-squared: 0.6266
## F-statistic: 140.3 on 1 and 82 DF, p-value: < 2.2e-16
second_fit_in <- lm(log_deg_dist_in~log_deg_in,weights=1/first_fit_in$fitted.values^2)
summary(second_fit_in)
##
## Call:
## lm(formula = log_deg_dist_in ~ log_deg_in, weights = 1/first_fit_in$fitted.values^2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.40836 -0.09346 -0.00206 0.10183 0.25245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.25816 0.23611 -9.564 5.45e-15 ***
## log_deg_in -0.72040 0.06956 -10.357 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1367 on 82 degrees of freedom
## Multiple R-squared: 0.5667, Adjusted R-squared: 0.5615
## F-statistic: 107.3 on 1 and 82 DF, p-value: < 2.2e-16
The obtained fitting is \(\widehat{\log(p_d)}=-1.50779-0.93294\log(d)\). The coefficient of determination is \(R^2=0.6311\). Secondly, we estimate the variances of the errors and then carry out WLS that lead to the fitted model \(\widehat{\log(p_d)}=-2.25816-0.72040\log(d)\) with coefficient of determination \(R^2=0.5667\) (worse than before). The estimated slope \(\hat{\alpha}=-0.72040\). indicating a moderate decrease, which appears to confirm that while most vertices have small degrees, a few vertices have larger degrees, although not extraordinarily large. Finally, we add the regression line to the plot.
color_2 <- "indianred2"
plot(log_deg_in,log_deg_dist_in,
col=color_1,main="Log-log in-degree distribution",
xlab="Log-Degree",ylab="Log-Frequency",pch=19)
abline(first_fit_in$coefficients[1],first_fit_in$coefficients[2],col=color_2,lwd=2)
We now do the same for the out-degree distribution.
plot(degree_dis_positive_out-1,degree_dis_out[degree_dis_positive_out],log="xy",
col=color_1,main="Log-log OUT-degree distribution",
xlab="Log-Degree",ylab="Log-Frequency",pch=19)
plot(degree_dis_positive_out[-c(1,2)] - 1, degree_dis_out[degree_dis_positive_out][-c(1,2)], log="xy",
col=color_1, main="Log-log OUT-degree distribution",
xlab="Log-Degree", ylab="Log-Frequency", pch=19)
log_deg_dist_out <- log(degree_dis_out[degree_dis_positive_out])[c(-1,-2)]
log_deg_out <- log(degree_dis_positive_out-1)[c(-1,-2)]
first_fit_out <- lm(log_deg_dist_out~log_deg_out)
summary(first_fit_out)
##
## Call:
## lm(formula = log_deg_dist_out ~ log_deg_out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10891 -0.63357 0.06629 0.73035 1.31798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8099 0.4342 -4.168 8.75e-05 ***
## log_deg_out -0.8266 0.1205 -6.859 2.39e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.805 on 69 degrees of freedom
## Multiple R-squared: 0.4054, Adjusted R-squared: 0.3968
## F-statistic: 47.05 on 1 and 69 DF, p-value: 2.39e-09
With ordinary least squares, we obtain a fitting \(\widehat{\log p_d}=-1.8099+-0.8266\log(d)\) for the out-degree distribution. The coefficient of determination is \(R^2=0.4054\). This is not a very good result, but in light of the non-linearity of our data, it was something we could expect.
second_fit_out <- lm(log_deg_dist_out~log_deg_out,weights=1/first_fit_out$fitted.values^2)
summary(second_fit_out)
##
## Call:
## lm(formula = log_deg_dist_out ~ log_deg_out, weights = 1/first_fit_out$fitted.values^2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.51121 -0.15253 -0.01137 0.17664 0.31493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8091 0.3942 -7.125 7.89e-10 ***
## log_deg_out -0.5371 0.1171 -4.586 1.96e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1886 on 69 degrees of freedom
## Multiple R-squared: 0.2336, Adjusted R-squared: 0.2225
## F-statistic: 21.03 on 1 and 69 DF, p-value: 1.962e-05
Weighted least squares gives us the fit \(\widehat{\log p_d}=-2.8091-0.5371\log(d)\). The coefficient of determination is even worse, \(R^2=0.2336\). The estimated slope is \(\hat{\alpha}=-0.5371\), which again is a sensible result due to the sparse nature of the network.
We fit the regression line to the plot.
color_2 <- "indianred2"
plot(log_deg_out,log_deg_dist_out,
col=color_1,main="Log-log out-degree distribution",
xlab="Log-Degree",ylab="Log-Frequency",pch=19)
abline(first_fit_out$coefficients[1],first_fit_out$coefficients[2],col=color_2,lwd=2)
We now do this again, but taking into account the weights in the edges of our graph. Firstly, we compute the weighted degree sequence of the network, as well as the average weighted degree of the network. The strength of a vertex in a graph is defined as the sum of weights of its incident edges.
head(strength(g))
## Tammy Baldwin John Barrasso Michael Bennet Marsha Blackburn
## 0.1998895 0.2453818 0.5740193 0.2770146
## Richard Blumenthal Roy Blunt
## 0.3422145 0.2379193
mean(strength(g))
## [1] 0.3251765
The average weighted degree of the network is 0.3252, which indicates that on average, a congressperson has around 0.32 weighted influence via twitter on the rest of the congresspeople.
head(table(strength(g)), n = 20)
##
## 0.0226415094339623 0.0240038738240177 0.0303030303030303 0.0428015564202335
## 1 1 1 1
## 0.0467851893505431 0.0485175202156334 0.0491707206935448 0.0517164712883939
## 1 1 1 1
## 0.0566619815822638 0.0578027421665609 0.0645186308481954 0.0660765942668869
## 1 1 1 1
## 0.0673682109327181 0.0673933048848474 0.068700849826538 0.0697648405129694
## 1 1 1 1
## 0.0722693125942843 0.0758846268212905 0.0767129802135719 0.0771400083513075
## 1 1 1 1
In particular, the weighted degree distribution of our network is positively skewed. Indeed, the average weighted degree (0.3251765) is low compared with some of the weighted degrees that can be as high as 2.4755005.
To compute the weighted degree distribution, we have to write a
function we used in class called
graph.strength.distribution to do it because igraph does
not have a function to compute such distribution.
graph.strength.distribution <- function (graph, cumulative = FALSE, ...)
{
if (!is_igraph(graph)) {
stop("Not a graph object")
}
cs <- strength(graph, ...)
breaks <- seq(min(cs), max(cs), length.out = length(cs) + 1)
hi <- hist(cs, breaks = breaks ,plot=FALSE)$density
if (!cumulative) {
res <- hi
}
else {
res <- rev(cumsum(rev(hi)))
}
res
}
The function begins by validating the input to ensure it’s a valid
igraph object. If the input isn’t valid, the function stops
with an error message.
Next, the function calculates the strength of each vertex in the
graph and stores it in a variable called cs. It then
computes the density of the strength values by creating a histogram
without plotting it. The breaks for the histogram are determined based
on the minimum and maximum strength values in cs.
Depending on the cumulative argument, the function
either returns the computed density values directly or calculates the
cumulative distribution of the strength values using the
cumsum function. If cumulative is set to
TRUE, the function reverses the order of the density values
before calculating the cumulative sum to ensure the correct cumulative
distribution is obtained.
Finally, the function returns the computed result, which is either
the density values or the cumulative distribution of the strength
values, based on the cumulative argument.
We use it in our weighted graph.
weighted_degree_dis <- graph.strength.distribution(g)
head(weighted_degree_dis, n = 50)
## [1] 0.8153750 0.4076875 0.0000000 0.4076875 0.4076875 1.2230626 0.8153750
## [8] 0.0000000 2.0384376 0.8153750 1.6307501 2.0384376 0.4076875 2.0384376
## [15] 2.4461251 1.2230626 0.8153750 0.4076875 1.2230626 1.2230626 4.4845628
## [22] 2.8538127 1.2230626 2.0384376 1.6307501 2.4461251 2.0384376 3.6691877
## [29] 0.8153750 4.0768752 3.2615002 4.4845628 2.0384376 0.8153750 1.6307501
## [36] 1.6307501 0.4076875 3.6691877 2.0384376 3.2615002 2.8538127 3.2615002
## [43] 3.2615002 2.4461251 2.8538127 3.2615002 4.8922503 1.2230626 1.6307501
## [50] 2.0384376
We obtain and store the maximum degree to use it later in the plotting of the distribution.
max_weighted_degree <- max(strength(g))
max_weighted_degree
## [1] 2.4755
Now we can obtain our plot.
par(mar=c(5,4,4,2)+0.1)
plot(seq(0, max_weighted_degree, length.out = length(weighted_degree_dis)), weighted_degree_dis,col=color_1,
main="Weighted degree distribution of US Congress Network",
xlab="Weighted degree",ylab="Frequency",type="h",lwd=1.5)
points(seq(0, max_weighted_degree, length.out = length(weighted_degree_dis)), weighted_degree_dis, pch=20)
We can see that it is right-skewed. Consequently, we plot the weighted degree distribution in log scale, i.e., plot log of the weighted degrees vs log of the probabilities (relative frequencies). As before, we plot only those log-weighted-degrees and log-probabilities corresponding to weighted degrees and probabilities larger than 0, which we obtain previously.
weighted_degree_dis_positive <- which(weighted_degree_dis!=0)
weighted_degree_dis_positive
## [1] 1 2 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20
## [19] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [37] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## [55] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
## [73] 75 76 77 78 79 80 81 82 83 84 85 86 87 89 90 93 94 95
## [91] 96 97 98 99 100 101 102 103 104 106 107 108 109 111 112 113 114 115
## [109] 116 118 120 122 123 124 127 128 133 136 142 143 144 145 147 148 150 151
## [127] 163 170 172 176 213 225 228 234 257 266 475
plot(weighted_degree_dis_positive,
weighted_degree_dis[weighted_degree_dis_positive],
log="xy",col=color_1,main="Log-log weighted degree distribution of US Congress Network",
xlab="Log-Weighted-Degree",ylab="Log-Frequency",pch=19,lwd=1.5)
The result is that we can see several vertices are of low degree, while another set of vertices are of higher degree. Therefore, the differences are not so large as in the unweighted case and do not follow a linear model either. We separate in the in-degree and out-degree distributions.
in_degree_weighted_dis <- graph.strength.distribution(g, mode = "in")
out_degree_weighted_dis <- graph.strength.distribution(g, mode = "out")
par(mar=c(5,4,4,2)+0.1)
plot(seq(0, max(in_degree_weighted_dis), length.out = length(in_degree_weighted_dis)), degree(g, mode = "in"), col=color_1,
main="In-degree distribution of US Congress Network",
xlab="In-degree", ylab="Frequency", type="h", lwd=1.5)
points(seq(0, max(in_degree_weighted_dis), length.out = length(in_degree_weighted_dis)), degree(g, mode = "in"), pch=20)
par(mar=c(5,4,4,2)+0.1)
plot(seq(0, max(out_degree_weighted_dis), length.out = length(out_degree_weighted_dis)), degree(g, mode = "out"), col=color_1,
main="Out-degree distribution of US Congress Network",
xlab="Out-degree", ylab="Frequency", type="h", lwd=1.5)
points(seq(0, max(out_degree_weighted_dis), length.out = length(out_degree_weighted_dis)), degree(g, mode = "out"), pch=20)
We can see that neither are skewed. Consequently, we plot the weighted
degree distribution in log scale, i.e., plot log of the weighted degrees
vs log of the probabilities (relative frequencies). As before, we plot
only those log-weighted-degrees and log-probabilities corresponding to
weighted degrees and probabilities larger than 0, in this case, for both
in and our edges.
weighted_degree_dis_positive_in <- which(in_degree_weighted_dis!=0)
weighted_degree_dis_positive_in
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## [73] 74 76 77 78 80 83 84 85 86 87 88 89 90 91 92 93 97 99
## [91] 101 102 103 104 105 106 107 109 110 111 112 116 117 118 119 124 125 128
## [109] 129 133 134 136 145 147 149 152 153 155 168 170 172 176 178 179 183 184
## [127] 185 205 210 235 240 275 279 475
weighted_degree_dis_positive_out <- which(out_degree_weighted_dis!=0)
weighted_degree_dis_positive_out
## [1] 1 4 7 9 10 11 12 14 15 16 18 19 20 21 22 23 24 25
## [19] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## [37] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [55] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## [73] 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
## [91] 98 99 100 101 102 103 104 105 108 109 110 111 113 115 116 117 118 119
## [109] 120 121 122 123 124 125 127 128 129 132 133 135 136 137 138 139 140 141
## [127] 144 145 146 149 150 151 156 157 159 160 162 164 165 166 168 169 172 174
## [145] 181 190 197 199 200 203 204 205 211 213 222 224 225 228 230 239 241 254
## [163] 261 265 272 304 327 350 417 438 471 475
plot(weighted_degree_dis_positive_in,
in_degree_weighted_dis[weighted_degree_dis_positive_in],
log="xy",col=color_1,main="Log-log in-weighted degree distribution of US Congress Network",
xlab="Log-in-Weighted-Degree",ylab="Log-Frequency",pch=19,lwd=1.5)
Ww eliminate the first four observations to improve linearity.
plot(weighted_degree_dis_positive_in[-c(1,2,3,4)],
in_degree_weighted_dis[weighted_degree_dis_positive_in][-c(1,2,3,4)],
log="xy",col=color_1,main="Log-log in-weighted degree distribution of US Congress Network",
xlab="Log-in-Weighted-Degree",ylab="Log-Frequency",pch=19,lwd=1.5)
plot(weighted_degree_dis_positive_out,
out_degree_weighted_dis[weighted_degree_dis_positive_out],
log="xy",col=color_1,main="Log-log out-weighted degree distribution of US Congress Network",
xlab="Log-out-Weighted-Degree",ylab="Log-Frequency",pch=19,lwd=1.5)
We fit the linear regressions.
log_weighted_deg_dist_in <- log(in_degree_weighted_dis[weighted_degree_dis_positive_in])[-c(1,2,3,4)]
log_weighted_deg_in <- log(weighted_degree_dis_positive_in)[-c(1,2,3,4)]
first_fit_w_in <- lm(log_weighted_deg_dist_in~log_weighted_deg_in)
summary(first_fit_w_in)
##
## Call:
## lm(formula = log_weighted_deg_dist_in ~ log_weighted_deg_in)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07961 -0.35620 -0.05548 0.39734 1.02402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.60677 0.19818 18.20 <2e-16 ***
## log_weighted_deg_in -0.77600 0.04698 -16.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.472 on 128 degrees of freedom
## Multiple R-squared: 0.6807, Adjusted R-squared: 0.6782
## F-statistic: 272.9 on 1 and 128 DF, p-value: < 2.2e-16
plot(log_weighted_deg_in,log_weighted_deg_dist_in,
col=color_1,main="Log-log in-weighted degree distribution of US Congress network",
xlab="First Log-in-Weighted-Degree",ylab="Log-Frequency",pch=19)
abline(first_fit_w_in$coefficients[1],first_fit_w_in$coefficients[2],col=color_2,lwd=2)
log_weighted_deg_dist_out <- log(out_degree_weighted_dis[weighted_degree_dis_positive_out])
log_weighted_deg_out <- log(weighted_degree_dis_positive_out)
first_fit_w_out <- lm(log_weighted_deg_dist_out~log_weighted_deg_out)
summary(first_fit_w_out)
##
## Call:
## lm(formula = log_weighted_deg_dist_out ~ log_weighted_deg_out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11503 -0.53047 -0.01195 0.47997 1.41451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17535 0.24421 8.908 7.75e-16 ***
## log_weighted_deg_out -0.30859 0.05442 -5.671 5.99e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6554 on 170 degrees of freedom
## Multiple R-squared: 0.1591, Adjusted R-squared: 0.1541
## F-statistic: 32.16 on 1 and 170 DF, p-value: 5.993e-08
plot(log_weighted_deg_out,log_weighted_deg_dist_out,
col=color_1,main="Log-log out-weighted degree distribution of US Congress network",
xlab="Log-out-Weighted-Degree",ylab="Log-Frequency",pch=19)
abline(first_fit_w_out$coefficients[1],first_fit_w_out$coefficients[2],col=color_2,lwd=2)
The in-degree distribution of our graph provides valuable insights into its structure, especially when considering weighted edges. Our weighted in-degree distribution captures the importance or significance of nodes based on the sum of weights of incoming edges, providing a nuanced understanding of node centrality in the network.
On the other hand, the out-degree distribution’s behavior with weighted edges is different. The linear regression for the out-degree distribution does not fit the data as well when considering weights. This discrepancy could arise due to various reasons. For instance, the weights might introduce noise or the way weights are assigned could influence the distribution differently than simple counts of outgoing edges.
In summary, while our weighted in-degree distribution offers a richer understanding of a network’s structure, the linear regression of our weighted out-degree distribution does not provide a better fit compared to its unweighted counterparts. This highlights the importance of carefully considering the nature and implications of edge weights when analyzing graph structures.